10 research outputs found

    Feature selection for high dimensional imbalanced class data using harmony search

    Get PDF
    Misclassification costs of minority class data in real-world applications can be very high. This is a challenging problem especially when the data is also high in dimensionality because of the increase in overfitting and lower model interpretability. Feature selection is recently a popular way to address this problem by identifying features that best predict a minority class. This paper introduces a novel feature selection method call SYMON which uses symmetrical uncertainty and harmony search. Unlike existing methods, SYMON uses symmetrical uncertainty to weigh features with respect to their dependency to class labels. This helps to identify powerful features in retrieving the least frequent class labels. SYMON also uses harmony search to formulate the feature selection phase as an optimisation problem to select the best possible combination of features. The proposed algorithm is able to deal with situations where a set of features have the same weight, by incorporating two vector tuning operations embedded in the harmony search process. In this paper, SYMON is compared against various benchmark feature selection algorithms that were developed to address the same issue. Our empirical evaluation on different micro-array data sets using G-Mean and AUC measures confirm that SYMON is a comparable or a better solution to current benchmarks

    Studying crowdsourcing using machine learning and optimisation-based approaches

    Full text link
    This study addressed the challenge of optimising cost and improving accuracy for microtask-crowdsourcing platforms. The research introduced a new method that is able to optimise task assignment, reduce payments to spammers and avoid collecting their answers

    Optimizing microtask assignment on crowdsourcing platforms using Markov chain Monte Carlo

    Full text link
    Microtasking is a type of crowdsourcing, denoting the act of breaking a job into several tasks and allocating them to multiple workers to complete. The assignment of tasks to workers is a complex decision-making process, particularly when considering budget and quality constraints. While there is a growing body of knowledge on the development of task assignment algorithms, the current algorithms suffer from shortcomings including: after-worker quality estimation, meaning that workers need to complete all tasks after which point their quality can be estimated; and one-off quality estimation method which estimates workers\u27 quality only at the start of microtasking using a set of pre-defined quality-control tasks. To address these shortcomings, we propose a Markov Chain Monte Carlo–based task assignment approach known as MCMC-TA which provides iterative estimations of workers\u27 quality and dynamic task assignment. Specifically, we apply Gaussian mixture model (GMM) to estimate workers\u27 quality and Markov Chain Monte Carlo to shortlist workers for task assignment. We use Google Fact Evaluation dataset to measure the performance of MCMC-TA and compare it against the state-of-the-art algorithms in terms of AUC and F-Score. The results show that the proposed MCMC-TA algorithm not only outperforms the rival algorithms, but also offers a spammer-resistant result that maximizes the learning of workers\u27 quality with minimal budget

    Bee colony based worker reliability estimation algorithm in microtask crowdsourcing

    Full text link
    Estimation of worker reliability on microtask crowdsourcing platforms has gained attention from many researchers. On microtask platforms no worker is fully reliable for a task and it is likely that some workers are spammers, in the sense that they provide a random answer to collect the financial reward. Existence of spammers is harmful as they increase the cost of microtasking and will negatively affect the answer aggregation process. Hence, to discriminate spammers and non-spammers one needs to measure worker reliability to predict how likely that a worker put an effort in solving a task. In this paper we introduce a new reliability estimation algorithm works based on bee colony algorithm called REBECO. This algorithm relies on Gaussian process model to estimate reliability of workers dynamically. With bees that go in search of pollen, some are more successful than the others. This maps well to our problem, where some workers (i.e., bees) are more successful than other workers for a given task thus, giving rise to a reliability measure. Answer aggregation with respect to worker reliability rates has been considered as a suitable replacement for conventional majority voting. We compared REBECO with majority voting using two real world datasets. The results indicate that REBECO is able to outperform MV significantly
    corecore